December 14, 2018

Abstract

Movie is one of the most important entertainment ways in our daily life. I find a movie dataset, including 7 thousand+ movies, that scraped from Internet Movie Database (aka IMDB). I’m interested in several topics includes genres, movie budget & gross, geographic movie distribution, rating distribution and IMDB scores & votes.

I check the relationship between them and get several findings. First, movie industry is becoming more and more important. Second, high budget movie does not mean high earning. Third, USA can be defined as movie center in this world. Fourth, R rating movies were pictured more than others. Fifth, Score distribution is not follow the normal distribution. Finally, votes data follows the Benfold distribution. This helped us figure out that the scores in imdb deserve our believe.

Introduction:

I. Background

Internet Movie database is one of the most useful movie website to help us rate which movie deserve to watch.

It includes films, television programs, home videos and video games, and internet streams, including cast, production crew and personnel biographies, plot summaries, trivia, and fan reviews and ratings.

As a very famous website, imdb can be a very good project to make me do some anaylsis for movies on it.

II. data source

I get a dataset includes nearly 7000 movies from a kaggle data challenge. Data in this data set comes from imdb.com. The author scraped data from imdb.

In the data set,it has budget,company,country,director,genre,gross,name,rating, released,runtime,score,star,votes,writer,year.

EDA:

Attitude to genres

  • Traditional themes were falling: see the curve of drama, musical, Romance and Thriller.
  • Scientific theme was rasing: After 2000, more and more sci-fi and superhero theme were pictured.

Big Companies'Budget planning

  • linear relationship between Time and Budget:

From plot, the budget becomes higher along with time.

Effect of high Budget planning

  • risk and earning both exist

How big companies perform?

  • Big Companies has methods to keep rich

Which country is the movie center?

  • Hollywood : hollywood and the biggest movie companies are in the America.
  • Superhero : The audiences are attracted by the Superhero theme.

Rating ratio

a. Will the audiences'flavor vary with year?

The density plot displays a decreasing trend of density of PG. From 1986 to 2005, pg-13 shows a increasing trend and then it shows a decreasing trend from 2006 to 2016. R rating keeps same.

Rating ratio

b. people's attitude to three ratings

From plots, it's clear to see from voter distribution for three rating. More people are willing to vote for PG-13 movies.

How long movie can earn most?

  • Runtime from 125 to 150 minutes have a significant higher earning than other runtime.

Check Distribution

I. Normal Distribution

Plot the distribution

  • Score is skewness distribution. -The skewness here is -0.63(Left skewness) -the Kurtosis is 3.97(leptokurtic).

## [1] -0.634
## [1] 3.973

II. Implication for score distribution

The reason caused this skewness is that people are more willing to give a comparable higher score. In our normal idea, people think 5.0 is a bad score but not a normal level. This caused the skewness.

Benfold's Law

I.introduction

Benford's law is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. The second-order test looks at relationships and patterns in data and is based on the digits of the differences between amounts that have been sorted from smallest to largest (ordered). The digit patterns of the differences are expected to closely approximate the digit frequencies of Benford's Law. The summation test looks for excessively large numbers in the data.

II. Check Votes Data

From two digits distribution, the 2 digits of votes follow benfold's distribution. Only a few two digits has slightly higher than benford distribution. This also shows in the suspects_ranked dataset.The biggest absolute difference is 32 which is small.In conclusion, There is no significant dispendency number.

The higest chi-square between observed 2 digits frequence and expected frequence is 4, which is small. This proves that the frequency of observed 2-digit number are close to what they expected.

III. Suspecious data

Used 'getSuspects' function, we can see suspicious observations from the original data. There exist 393 out of 6338 suspicious observations.

we can see suspects rank. The highest absolute difference is only 32.35 which is small. So i think the Votes followed the benfold distribution,

digits absolute.diff
10 32.35
14 26.91
12 24.32
27 18.90
23 18.85
20 14.70

IV. Implication

Votes followes the benfold's distribution.

We can know that no significant cheat behaviors were found in vote data of imdb movies.

Thank you

Any Question?